The fidelity of Generative Adversarial Networks (GAN) inversion is impeded by Out-Of-Domain (OOD) areas (e.g., background, accessories) in the image. Detecting the OOD areas beyond the generation ability of the pretrained model and blending these regions with the input image can enhance fidelity. The ``invertibility mask" figures out these OOD areas, and existing methods predict the mask with the reconstruction error. However, the estimated mask is usually inaccurate due to the influence of the reconstruction error in the In-Domain (ID) area. In this paper, we propose a novel framework that enhances the fidelity of human face inversion by designing a new module to decompose the input images to ID and OOD partitions with invertibility masks. Unlike previous works, our invertibility detector is simultaneously learned with a spatial alignment module. We iteratively align the generated features to the input geometry and reduce the reconstruction error in the ID regions. Thus, the OOD areas are more distinguishable and can be precisely predicted. Then, we improve the fidelity of our results by blending the OOD areas from the input image with the ID GAN inversion results. Our method produces photo-realistic results for real-world human face image inversion and manipulation. Extensive experiments demonstrate our method's superiority over existing methods in the quality of GAN inversion and attribute manipulation.
translated by 谷歌翻译
In this work, we propose a self-supervised multi-agent system, termed a memory-like adaptive modeling multi-agent learning system (MAMMALS), that realizes online learning towards behavioral pattern clustering tasks for time series. Encoding the visual behaviors as discrete time series(DTS), and training and modeling them in the multi-agent system with a bio-memory-like form. We finally implemented a fully decentralized multi-agent system design framework and completed its feasibility verification in a surveillance video application scenario on vehicle path clustering. In multi-agent learning, using learning methods designed for individual agents will typically perform poorly globally because of the behavior of ignoring the synergy between agents.
translated by 谷歌翻译
Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved the new record 65.4 mAP on COCO test-dev. The code will be released at https://github.com/OpenGVLab/InternImage.
translated by 谷歌翻译
本文调查了2D全身人类姿势估计的任务,该任务旨在将整个人体(包括身体,脚,脸部和手)局部定位在整个人体上。我们提出了一种称为Zoomnet的单网络方法,以考虑到完整人体的层次结构,并解决不同身体部位的规模变化。我们进一步提出了一个称为Zoomnas的神经体系结构搜索框架,以促进全身姿势估计的准确性和效率。Zoomnas共同搜索模型体系结构和不同子模块之间的连接,并自动为搜索的子模块分配计算复杂性。为了训练和评估Zoomnas,我们介绍了第一个大型2D人类全身数据集,即可可叶全体V1.0,它注释了133个用于野外图像的关键点。广泛的实验证明了Zoomnas的有效性和可可叶v1.0的重要性。
translated by 谷歌翻译
2D姿势估计的现有作品主要集中在某个类别上,例如人,动物和车辆。但是,有许多应用程序方案需要检测看不见的对象类的姿势/关键点。在本文中,我们介绍了类别不稳定姿势估计(CAPE)的任务,该任务旨在创建一个姿势估计模型,能够检测仅给出一些具有关键点定义的样本的任何类别对象的姿势。为了实现这一目标,我们将姿势估计问题作为关键点匹配问题制定,并设计一个新颖的Cape框架,称为姿势匹配网络(POMNET)。提出了基于变压器的关键点交互模块(KIM),以捕获不同关键点之间的交互以及支持图像和查询图像之间的关系。我们还介绍了多类姿势(MP-100)数据集,该数据集是包含20K实例的100个对象类别的2D姿势数据集,并且经过精心设计用于开发CAPE算法。实验表明,我们的方法的表现优于其他基线方法。代码和数据可在https://github.com/luminxu/pose-for-venthing上找到。
translated by 谷歌翻译
在语义细分中进行了无监督的域的适应,以减轻对昂贵像素的依赖的依赖。它利用标有标记的源域数据集以及未标记的目标域图像来学习分割网络。在本文中,我们观察到现有的域不变学习框架的两个主要问题。 (1)由于特征分布对齐而分心,网络不能专注于分割任务。 (2)拟合源域数据很好地损害了目标域性能。为了解决这些问题,我们提出了减轻过度拟合源域的脱钩,并使最终模型能够更多地专注于细分任务。此外,我们提出自我歧视(SD),并引入辅助分类器,以使用伪标签学习更多歧视目标域特征。最后,我们建议在线增强自我训练(OEST),以在线方式上下文提高伪标签的质量。实验表明,我们的方法优于现有的最新方法,广泛的消融研究验证了每个组件的有效性。代码可在https://github.com/dvlab-research/decouplenet上找到。
translated by 谷歌翻译
自从搜索空间通常相当巨大(例如,$ 13 ^ {21}),训练单次NAS方法中的一个良好的Supernet很难。为了提高超网络的评估能力,一个贪婪的策略是采样良好的路径,让超标倾向于良好的路径并减轻其评估负担。然而,在实践中,由于良好路径的识别不够准确并且采样路径仍然围绕整个搜索空间散射,因此搜索仍然是效率效率低下。在本文中,我们利用显式路径滤波器来捕获路径的特征,并直接过滤那些弱的路径,从而可以更加贪婪地且有效地在缩小空间上实现搜索。具体地,基于良好的路径小于空间中的弱者的事实,我们认为“弱道”的标签将比多道路采样中的“良好路径”更自信和可靠。通过这种方式,我们因此将路径滤波器的训练施放在正面和未标记的(PU)学习范例中,并且还鼓励一个\ Texit {路径嵌入}作为更好的路径/操作表示,以增强学习过滤器的识别容量。通过这种嵌入的DINT,我们可以通过将类似的嵌入式汇总相似的操作进一步缩小搜索空间,搜索可以更高效和准确。大量实验验证了所提出的方法GredynaSv2的有效性。例如,我们获得的GreedynaSv2-L验证$ 81.1 \%$ 1 $ top-1在想象数据数据上的准确性,显着优于Reset-50强的基线。
translated by 谷歌翻译
过去几年的技术创新的巨大浪潮,标志着AI技术的进展,是深刻的重塑行业和社会。然而,在路上,一个关键的挑战等待着我们,即我们满足快速增长的情景的能力的能力受到收购培训数据的成本的严重限制。由于主流学习范式的局限性,这一困难的局面是基于主流学习范式的局限性:我们需要根据大量注释的数据以及通常从头来训练每个新场景的新模型。在解决这一基本问题时,我们超越并开发一个名为实习生的新学习范式。通过在多个阶段的来自多个来源的监控信号学习,培训的模型将产生强大的相互性。我们在26个众所周知的数据集中评估我们的模型,该数据集涵盖计算机视觉中的四类任务。在大多数情况下,我们的模型仅适用于目标域中的培训数据的10%,始终以完整的数据培训的对应物,通常由显着的边距。这是一个重要前景的重要一步,其中具有一般视觉能力的这种模型可以大大降低对数据的依赖,从而加速通过AI技术的采用。此外,围绕我们的新范式旋转,我们还介绍了一个新的数据系统,新的架构和新的基准,以及一起形成一般愿景生态系统,以开放和包容性的方式支持其未来的发展。
translated by 谷歌翻译
视觉变形金刚(VITS)继承了NLP的成功,但它们的结构尚未充分调查并针对视觉任务进行优化。最简单的解决方案之一是通过CNN中的广泛使用的神经结构搜索(NAS)直接搜索最佳的问题。但是,我们经验探讨了这种直接的适应将遇到灾难性的失败,并对超级形式的培训感到沮丧。在本文中,我们认为,由于VITS主要在令牌嵌入具有很小的归纳偏差上运行,因此不同架构的通道的不平衡将使重量共享假设恶化并导致培训不稳定。因此,我们开发了一种新的循环重量共享机制,用于令牌的VITS嵌入式,这使得每个通道能够更均匀地贡献所有候选架构。此外,我们还提出了身份转移,以减轻超级形式的多对一问题,并利用弱的增强和正规化技术以维持更稳定的培训。基于这些,我们所提出的方法Vitas在Deit-and Twins的Vits中取得了显着的优势。例如,只有1.4美元的G拖鞋预算,我们搜索的架构有3.3 \%$ ImageNet-比基准Deit为1美元$ k准确性。我们的结果达到3.0美元,我们的结果达到了82.0 \%$ 1 $ k,$ 1 $ k,$ 45.9 \%$ 2017 $上涨,这是2.4美元的$ 2.4 \%$优于其他VITS。
translated by 谷歌翻译
Object detection, one of the most fundamental and challenging problems in computer vision, seeks to locate object instances from a large number of predefined categories in natural images. Deep learning techniques have emerged as a powerful strategy for learning feature representations directly from data and have led to remarkable breakthroughs in the field of generic object detection. Given this period of rapid evolution, the goal of this paper is to provide a comprehensive survey of the recent achievements in this field brought about by deep learning techniques. More than 300 research contributions are included in this survey, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics. We finish the survey by identifying promising directions for future research.
translated by 谷歌翻译